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Chapter 1 
Introduction 1 



Imagine sitting at a cafe (or "other" public venue) and you hear a song playing on the stereo. You decide 
that you really like it, but you don't know the name of the song. There's a solution for that. Software 
song identification has been a topic of interest for years. However, it is computationally difficult to tackle 
this problem using conventional algorithms. Frequency analysis provides for a fast and accurate solution to 
this problem, and we decided to use this analysis to come up with a fun project idea. The main purpose of 
our project was to be able to accurately match a noisy song segment with a song in our song library. The 
company Shazam was our main inspiration and we started out by studying how Shazam works. 



1 This content is available online at <http://cnx.Org/content/m33185/l.2/>. 
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CHAPTER 1. INTRODUCTION 



Chapter 2 

The Fingerprint of a Song 1 



Just like how every individual has a unique fingerprint that can be used to distinguish one person from 
another, our algorithm creates a digital fingerprint for each song that can be used to distinguish two songs. 
The song's fingerprint consists of list of time-frequency pairs that uniquely represent all the significant peaks 
in the song's spectrogram. To assure accurate matching between two fingerprints, our algorithm needs to 
take into account the following issues when choosing peaks for the fingerprint: 

• Uniqueness - The fingerprint of each song needs to be unique to that one song. Fingerprints of 
different songs need to be different enough to be easily distinguished by our scoring algorithm. 

• Sparseness - The computational time of our matched filter depends on the amount of data in each 
song's fingerprint. Thus each fingerprint needs to sparse enough for fast results, but still contain enough 
information to provide accurate matches. 

• Noise Resistant - Song data may contain large amounts of background noise. The fingerprinting 
algorithm must be able to differentiate between the signal and added noise, storing only the signal 
information in the fingerprint. 

These criteria are all met by identifying major peaks in the song's spectrogram. The following section 
describes the fingerprinting algorithm in more detail. 



1 This content is available online at <http://cnx.Org/content/m33186/l.2/>. 
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CHAPTER 2. THE FINGERPRINT OF A SONG 



Chapter 3 

The Fingerprint Finding Algorithm 



3.1 Filtering and Resampling 

After the song data is imported, the signal is then resampled to 8000 samples per second in order to reduce 
the number of columns in the spectrogram. This will speed up later computations but still leaves enough 
resolution in the data for accurate results. 

Then the data is high-pass filtered using a 30 th order filter with a cutoff frequency around 2KHz (half 
the bandwidth of the resampled signal). Filtering is used because the higher frequencies in songs are more 
unique to each individual song. The bass, however, tends to overshadow these frequencies, thus the filter 
is used make fingerprint include more high frequencies points. Testing has shown that the algorithm has a 
much easier time distinguishing songs after they are high-pass filtering. 

3.2 The Spectrogram 

The spectrogram of the signal is then taken in order to view the frequencies present in each time slice. The 
spectrogram below is from a 10 second noisy recording. 



1 This content is available online at <http://cnx.Org/content/m33188/l.4/>. 



CHAPTER 3. THE FINGERPRINT FINDING ALGORITHM 




Spectrogram 




Figure 3.1: The effect of the low-pass filter is clearly visible in the spectrogram. However, local maxima 
in the low frequencies still exist and will still show up in the fingerprint. 



Each vertical time slice in the bin is then analyzed for prominent local maxima as described in the next 
section. 



3.3 Finding the Local Maxima 

In the first time slice, the five greatest local maxima are stored as points in the fingerprint. Then a threshold 
is created by convolving these five maxima with a Gaussian curve, creating a different value for the threshold 
at each frequency. An example threshold is shown in the figure below. The threshold is used to spread out 
the data stored in the fingerprint, since peaks that are close in time and frequency are stored as one point. 



Threshold for First Time Bin 




Frequency (Bins) 



Figure 3.2: The initial threshold, formed by convolving the peaks in the first time slice with a Gaussian 
curve. 



For each of the remaining time slices, up to five local maxima above the threshold are added to fingerprint. 
If there are more than five maxima, then the five greatest in amplitude are chosen. The threshold is then 
updated by adding new Gaussian curves centered at the frequencies of the newly found peaks. Finally the 
threshold is scaled down so that it decays exponentially over time. The following figure shows how the 
threshold changes over time. 
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Change of Threshold over Time 
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Figure 3.3: The threshold increases whenever a new peak is formed around that peak's frequency and 
decays exponentially over time. 



The final list of the time and frequencies of the local maxima above the threshold are returned as the 
song's fingerprint. 



Chapter 4 

The Resulting Fingerprint 1 



The following is the fingerprint of the sample signal from the examples above. 




Figure 4.1: The fingerprint of the 10 second segment from the previous examples 



From the graph, it is easy to see patterns and different notes in the song. Lets see how the algorithm 
addresses the three issues identified in the first paragraph: 



1 This content is available online at <http://cnx.Org/content/m33189/l.4/>. 
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10 CHAPTER 4. THE RESULTING FINGERPRINT 

• Uniqueness - The algorithm only stores the prominent peaks in the spectrogram. Different songs have 
a different pattern of peaks in frequency and time, thus each song will have a unique fingerprint. 

• Sparseness - The algorithm only picks up at most five peaks per time slice. This limits the number 
of peaks in the resulting fingerprint. The threshold spreads out the positions of peaks so that the 
fingerprint is more representational of the data. 

• Noise Resistant - Unless the background noise is loud enough to create peaks greater than the peaks 
present in the song, then very little noise will show up in the fingerprint. Also, a ten second segment has 
around 6000 data points, so a matched filter will be able to detect a match between two fingerprints, 
even with a reasonable amount of added noise. 

The next section will detail the process used to compare the fingerprint of the song segment to the fingerprints 
of the songs in the library. 



Chapter 5 

Matched Filter for Spectrogram Peaks 1 



In order to compare songs, we can generate match scores for them using a matched filter. We wanted a 
filter capable of taking the spectral peaks information generated by the fingerprint finding algorithm for two 
different songs and produce a single number that would tell us how much the two songs being compared look 
alike. We wanted this filter to be as insensitive as possible to noise and produce a score that is independent 
of the length of each recording. 

Our approach to this was completely different from that used by the creators of Shazam, as we did not 
use Hash tables at all and did not combine the peaks into pairs limited by certain regions, as they did. In 
the end we still managed to get very good accuracy and decent performance by using a matched filter. 



1 This content is available online at <http://cnx.Org/content/m33191/l.l/>. 
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Chapter 6 

The Matched Filter Algorithm 



6.1 Preparation 

Before filtering, we take the lists of spectral peaks that is the output of the landmarks generator algorithm 
and generate matrices that are the same size as the spectrograms, with the peaks replaced by l's in their 
respective positions and all other points replaced by O's. At some point during our project we had the idea 
of convolving this matrix with a Gaussian curve, in order to allow peaks to match somewhat if they were 
shifted only slightly. However, we later determined that even a very small Gaussian would worsen our noise 
resistance, so this idea was dropped. So basically now we have one map for each song that shows the position 
in time and frequency bins of all peaks. Next we normalize these matrices using their Frobenius norm. This 
ensures that the final score is normalized. Then we apply the matched filter which basically consists of 
flipping one of the matrices and convolving them, which is done by zero padding them both to the proper 
size and multiplying their 2D FFT's, for speed. The result is a cross correlation matrix, but we still need to 
extract a single number from it to be our match score. 

6.2 Extracting Information from the Cross Correlation Matrix 

Through much testing, we determined that the most accurate and noise-resistant measure of the match was 
simply taking the global maximum of the result. Other approaches that we tried, such as taking the trace 
of the X T X or the sum of the global maxima for each row or column, had much more frequent mismatches. 
Taking just the global maximum of the whole matrix was simple and extremely effective. 

When looking at test results, however, we saw that the score still had a certain dependency on the 
size of the segments being compared. Through more testing, we determined that this dependency looked 
approximately like a dependency on the square root of the ratio of the lower number of peaks by the higher 
number of peaks, when testing with a noiseless fragment of a larger song. This can be seen in this plot: 



1 This content is available online at <http://cnx.Org/content/m33193/l.3/>. 
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CHAPTER 6. THE MATCHED FILTER ALGORITHM 



o 
u 
CO 



0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0' 



Match Scores Dependency on Number of Peaks 



■& 



-.0 



S> 



Q 











Q 



.Q 



-_® 







Q 



Q 







,$' 



,Q 







^ 



,Q?' 



^ 



fifi> 



P' 



,<# 







S0 



tftf 



,w 




,o 



500 1000 1500 2000 2500 3000 3500 4000 4500 5000 

Number of Peaks 



Figure 6.1: A plot showing the score of a song fragment that should perfectly match the song it was 
taken from, seen without correcting the square root dependency mentioned above 



In the plot above, the original segment has 6915 peaks and the fragment was tested with between 100 
and 5000 peaks, in intervals of 100. Since smaller sample sizes usually lead to having fewer peaks, we had 
to get rid of this dependency. To prevent the square root growth of the scores, the final score is multiplied 
by the inverse of this square root, yielding a match score that is approximately independent of sample size. 
This can be seen in the next stem plot, made with the same segments as the first: 
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Figure 6.2: The same plot shown before, but with the square root dependency on number of peaks 
removed 



So clearly this allows us to get better match scores with small song segments. After this process, we had 
a score that was approximately independent of segment size, normalized and could tell apart matches and 
mismatches, even with lots of noise. All that was left was to test it against different sets of data and set a 
threshold for distinguishing between matches and non-matches. 



6.3 Setting a Threshold 

The filter's behavior proved to be very consistent. Perfect matches (trying to match a segment with itself) 
always got scores of 1. Matching noiseless segments to the whole song usually yielded scores in the upper 
.8's or in the .9's, with a few rare exceptions that could have been caused by a bad choice of segment, such 
as a segment with a long period of silence, for example. Noisy segments usually gave us low scores such as 
in the .l's, but more importantly mismatches were even lower, in the .05's to .07's or so. This allowed us to 
set a threshold for determining when we have a match or not. 

During our testing, we considered using a statistical approach to set the threshold. For example, if we 
wanted a 95% certainty that a song matched, we could require the highest match score to be greater than 
1.66*[<r/sqrt(n)] + /i, where a is the standard deviation, n is the sample size and (j, is the mean. However, 
with our very small sample size, this threshold seemed to yield inaccurate results, so the simple threshold 
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criterion of the highest match having to be at least 1.5 times the second highest in order to be considered a 
match was used. 

6.4 Similarities and Differences from Shazam's Approach 

Even though we followed the ideas in the paper by Wang, we still had some significant differences from 
the approach used by Shazam. We followed the ideas they had for fingerprint creation, to a certain extent, 
however the company uses hash tables instead of matched filters to perform the comparison. While evidently 
faster than using a matched filter, hash tables are not covered in ELEC 301. Furthermore, when making a 
hash, Wang says they combine several points in an area with an anchor point and pair them up combinato- 
rially. This allows the identification of a time offset to be used with the hash tables and makes the algorithm 
even faster and more robust. Perhaps investigating this would be an interesting extension of the project, if 
we had more time. 



Chapter 7 
Results 1 



The final step in the project was to test the algorithm we had created so we went ahead and conducted a 
series of tests that would evaluate mostly correctness but also, to some extent, performance. 

7.1 Testing 

First, we wanted to test to make sure that our algorithm was working properly. To do this, we attempted to 
match short segments of the original song (i.e. "noiseless", actual copies of the library songs) of approximately 
ten seconds in length. The table below shows how these original clips matched. The titles from left to right 
are song segments, and titles running from top to bottom are library songs. We abbreviated them from 
the original, so they would fit in the matrix. The original names are "Stop this Train", by John Mayer, 
"Semi-Charmed Life", by Third Eye Blind, "I've got a Feeling" by Black Eyed Peas, "Love Like Rockets", by 
Angels and Airwaves, "Crash Into Me", by Dave Matthews Band and "Just Another Day in Paradise", by 
Phil Vassar. 



Noiseless Recordings 





Train 


Life 


Feeling 


Rockets 


Crash 


Paradise 


Train 


0.8685 


0.0611 


0.0876 


0.0695 


0.0886 


0.0803 


Life 


0.0987 


0.2914 


0.0869 


0.071 


0.0725 


0.0736 


Feeling 


0.1091 


0.0679 


0.9292 


0.0695 


0.0687 


0.075 


Rockets 


0.0822 


0.0691 


O.OS55 


0.9488 


0.064S 


0.085 


Crash 


0,1255 


0.0593 


0.0967 


0.0649 


0.9031 


0.0716 


Paradise 


0.1151 


0.0722 


0.096 


0.0756 


0.0777 


0.9398 



Figure 7.1: This matrix shows the match score results of the six noiseless recordings made from 
fragments of songs in the database, each of them compared to all songs in the database 



1 This content is available online at <http://cnx.org/content/m33194/!. 3/>. 
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CHAPTER 7. RESULTS 



The clear matches with highest scores can be seen along the diagonal. Most of these are close to 1, and 
each match meets our criteria of being 1.5 times greater than the other scores (comparing horizontally.) This 
was a good test that we were able to use to modify our algorithm and try different techniques. Ultimately, 
the above results showed that our code was sufficient for our needs. 

We then needed to see if our code actually worked with real world (noisy) song segments. Songs were 
recorded on an iPhone simultaneously with various types of noise as follows: Train- low volume talking, Life- 
loud recording (clipping), Crash- typing, Rockets- repeating computer error noise, Feeling- Gaussian noise 
(added in Matlab to wav file), and Paradise- very loud talking. There were two additional songs we used in 
this test to check for robustness and proper matching. One is a live version of Crash, which includes a lot 
of crowd noise but does not necessarily have all the identical features of the original Crash fingerprint. The 
other additional song, "Yellow", by Coldplay, is a song that is not in our library at all. 



Train Life 



Train 
Life 

Feeling 
Rockets 
Crash 
Paradise 



Noisy Recordings Using an iPhone 

Feeling Rockets Crash Paradise Crash {Live) Yellow 



0.1385 


0.0631 


0.0705 


0.0774 


0.0892 


0.0661 


0.0598 


0.0694 


0.0612 


0.1269 


0.0721 


0.0842 


0.0809 


0.0646 


0.0604 


0.0666 


0.055 


0.0623 


0.3468 


0.0867 


0.0734 


0.063 


0.0586 


0.0637 


0.0619 


0.0764 


0.0755 


0.1759 


0.0679 


0.0764 


0.058 


0.066 


0.0675 


0.0631 


0.0733 


0.0741 


0.1619 


0.0669 


0.0682 


0.0711 


0.0675 


0.0734 


0.0738 


0.0934 


0.0837 


0.1189 


0.0622 


0.0694 



Figure 7.2: This matrix shows the match score results of the six noisy recordings made from fragments 
of songs in the database, plus a live version of a song in the database and another song entirely not in 
the database 



Again, the clear matches are highlighted in yellow along the diagonal. The above results show that our 
algorithm can still accurately match the song segments in more realistic conditions. The graph below shows 
more interesting results. 
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Figure 7.3: This plot is a visual representation of the results matrix seen above 



7.2 Conclusions 

As before, the matches in the first six songs (from left to right) are obvious, and Yellow does not show any 
clear correlation to any library song, as desired, but the live version of Crash presents an interesting question. 
Do we actually want this song to match? Since we wanted our fingerprinting method to be unique to each 
song and song segment, we decided it would be best to have a non-match in this scenario. However, if one 
observes closely, it can be seen that the closest match (though it is definitely not above the 1.5 mark) is, 
in fact, matching to the original Crash. This emerges as a small feature of our results. This small "match" 
says that although we may not match any songs in the library, we can tell you that this live version most 
resembles the original Crash version, which may be a desirable outcome if we were to market this project. 

We were amazed that the final filter could perform so well. The idea of completely ignoring amplitude 
information in the filter came from the paper by Avery Li-Chun Wang, one of Shazam's developers. As he 
mentions, discarding amplitude information makes the algorithm more insensitive to equalization. However, 
this approach also makes it more noise resistant since, since what we do from there on basically consists of 
counting matching peaks versus non-matching peaks. Any leftover noise will count very little towards the 
final score, as the number of peaks per area in the spectrogram is limited by the thresholding algorithm and 
all peaks have the same magnitude in the filter. 
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Chapter 8 
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Digital Song Analysis Using Frequency Analysis 

Summary: This module is the final report of the ELEC 301 project done by the group Curtis Thompson, 
Dante Soares, Shahzaib Shaheen and Yilong Yao. In this project, we created a Matlab program capable of 
identifying a noisy segment of a song recorded from a set of songs in a database. 
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